Linear Models for Predicting Continuous Values

Linear Structure

$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Given $N$ observations, $\xv_n$, for $n=1,\ldots,N$, and target values, $t_n$, for $n=1,\ldots,N$, what is the simplest model, $g(\xv)$, you can think of?

$$ g(\xv) = 0 $$

or maybe

$$ g(\xv) = c $$

What is next simplest model?

$$ \begin{align*} g(\xv;\wv) &= w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D \\ &= w_0 + \sum_{i=1}^D w_i x_i \\ & = \sum_{i=0}^D w_i x_i \mbox{, where } x_0 = 1 \\ &= \wv^T \xv \end{align*} $$
  • This is nice because it is linear in the parameters $\wv$; optimizations based on derivatives might be solvable analytically.
  • This is not so nice, because it is also linear in the inputs, $\xv$; greatly limits the complexity of the model.
  • But, a model linear in the inputs might be the best you can do for many cases, such as a sparsely sampled distribution, process, population, thing...whatever it is you want to model.

Fitting Data Samples with a Linear Model

Force exerted by spring is proportional to its length. The potential energy stored in the spring is proportinal to the square of its length. Say we want the rod to settle at position that minimizes the potential energy in the springs. $$ \begin{align*} \sum_{n=1}^N (t_n - g(\xv_n;\wv))^2 \end{align*} $$

If $g$ is an affine (linear + constant) function of $x$, $$ g(\xv;\wv) = w_0 + w_1 x $$ with parameters $\wv = (w_0, w_1)$, which parameter values give best fit? $$ \wv_{\mbox{best}} = \argmin{\wv} \sum_{n=1}^N (t_n - g(x_n ; \wv))^2 $$

Set derivative (gradient) with respect to $\wv$ to zero and solve for $\wv$. Let's do this with matrices.

Collect all targets into matrix $T$ and $x$ samples into matrix $X$. ($N$=number samples, $D$=sample dimensionality) $$ \begin{align*} T &= \begin{bmatrix} t_1 \\ t_2 \\ \vdots \\ t_N \end{bmatrix} \\ X &= \begin{bmatrix} x_{1,0} & x_{1,1} & x_{1,2} & \dotsc & x_{1,D} \\ x_{2,0} & x_{2,1} & x_{2,2} & \dotsc & x_{2,D} \\ \vdots \\ x_{N,0} & x_{N,1} & x_{N,2} & \dotsc & x_{N,D} \end{bmatrix}\\ \wv &= \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_D \end{bmatrix} \end{align*} $$

Collection of all differences is $T - X\wv$, which is an $N \times 1$ matrix. To form the square of all values and add them up, just do a dot product $(T-X\wv)^T (T-X\wv)$. This only works if the value we are predicting is a scalar, which means $T$ is a column matrix. If we want to predict more than one value for each sample, $T$ will have more than one column. Let's continue with the derivation assuming $T$ has $K$ columns, meaning we want a linear model with $K$ outputs.

To find the best value for $\wv$, we take the derivative the the sum of squared error objective, set it equal to 0 and solve for $\wv$.

$$ \begin{align*} \frac{\partial \sum_{n=1}^N (\tv_n - g(\xv_n;\wv))^2}{\partial \wv} &= -2 \sum_{n=1}^N (\tv_n - g(\xv_n;\wv) \frac{\partial g(\xv_n;\wv)}{\partial \wv}\\ &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \frac{\partial \xv_n^T \wv}{\partial \wv}\\ &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \xv_n^T \end{align*} $$

Here's where we get the benefit of expressing the $\xv_n$ and $t_n$ samples as matrices. The sum can be performed with a dot product:

$$ \begin{align*} \frac{\partial \sum_{n=1}^N (\tv_n - g(\xv_n;\wv))^2}{\partial \wv} &= -2 \sum_{n=1}^N (\tv_n - \xv_n^T \wv) \xv_n^T\\ &= -2 \Xv^T (\Tv - \Xv \wv) \end{align*} $$

Check the sizes and shapes of each matrix in the last equation above.

Now we can set this equal to zero and solve for $\wv$.

$$ \begin{align*} -2 \Xv^T (\Tv - \Xv \wv) &= 0\\ \Xv^T (\Tv - \Xv \wv) &= 0\\ \Xv^T \Tv &= \Xv^T \Xv \wv\\ \wv &= (\Xv^T \Xv)^{-1} \Xv^T \Tv \end{align*} $$

And, in python

w = np.dot( np.linalg.inv(np.dot(X.T, X)), np.dot(X.T,T) )

or, you may use the solve function that assumes $\Xv^T \Xv$ is full rank (no linearly dependent columns),

w = np.linalg.solve(np.dot(X.T,X), np.dot(X.T, T))

or, better yet, use the lstsq function that does not make that assumption.

w = np.linalg.lstsq(np.dot(X.T,X), np.dot(X.T, T))

Application

Try the automobile data set at UCI Machine Learning Database Repository. Download auto-mpg.data and auto-mpg.names from the "Data Folder" link near the top of the web page.


In [1]:
!wget http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
!wget http://mlr.cs.umass.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.names


/bin/sh: wget: command not found
/bin/sh: wget: command not found

First, take a look at auto-mpg.names. There you will learn that there are 398 samples, each with 8 numerical attributes and one string attribute. Their names are

  1. mpg: continuous
  2. cylinders: multi-valued discrete
  3. displacement: continuous
  4. horsepower: continuous
  5. weight: continuous
  6. acceleration: continuous
  7. model year: multi-valued discrete
  8. origin: multi-valued discrete
  9. car name: string (unique for each instance)

First step, read it into python and look at it.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Take a look at a few lines of the data file to figure out how to read it in.


In [3]:
!head -40 auto-mpg.data


18.0   8   307.0      130.0      3504.      12.0   70  1	"chevrolet chevelle malibu"
15.0   8   350.0      165.0      3693.      11.5   70  1	"buick skylark 320"
18.0   8   318.0      150.0      3436.      11.0   70  1	"plymouth satellite"
16.0   8   304.0      150.0      3433.      12.0   70  1	"amc rebel sst"
17.0   8   302.0      140.0      3449.      10.5   70  1	"ford torino"
15.0   8   429.0      198.0      4341.      10.0   70  1	"ford galaxie 500"
14.0   8   454.0      220.0      4354.       9.0   70  1	"chevrolet impala"
14.0   8   440.0      215.0      4312.       8.5   70  1	"plymouth fury iii"
14.0   8   455.0      225.0      4425.      10.0   70  1	"pontiac catalina"
15.0   8   390.0      190.0      3850.       8.5   70  1	"amc ambassador dpl"
15.0   8   383.0      170.0      3563.      10.0   70  1	"dodge challenger se"
14.0   8   340.0      160.0      3609.       8.0   70  1	"plymouth 'cuda 340"
15.0   8   400.0      150.0      3761.       9.5   70  1	"chevrolet monte carlo"
14.0   8   455.0      225.0      3086.      10.0   70  1	"buick estate wagon (sw)"
24.0   4   113.0      95.00      2372.      15.0   70  3	"toyota corona mark ii"
22.0   6   198.0      95.00      2833.      15.5   70  1	"plymouth duster"
18.0   6   199.0      97.00      2774.      15.5   70  1	"amc hornet"
21.0   6   200.0      85.00      2587.      16.0   70  1	"ford maverick"
27.0   4   97.00      88.00      2130.      14.5   70  3	"datsun pl510"
26.0   4   97.00      46.00      1835.      20.5   70  2	"volkswagen 1131 deluxe sedan"
25.0   4   110.0      87.00      2672.      17.5   70  2	"peugeot 504"
24.0   4   107.0      90.00      2430.      14.5   70  2	"audi 100 ls"
25.0   4   104.0      95.00      2375.      17.5   70  2	"saab 99e"
26.0   4   121.0      113.0      2234.      12.5   70  2	"bmw 2002"
21.0   6   199.0      90.00      2648.      15.0   70  1	"amc gremlin"
10.0   8   360.0      215.0      4615.      14.0   70  1	"ford f250"
10.0   8   307.0      200.0      4376.      15.0   70  1	"chevy c20"
11.0   8   318.0      210.0      4382.      13.5   70  1	"dodge d200"
9.0    8   304.0      193.0      4732.      18.5   70  1	"hi 1200d"
27.0   4   97.00      88.00      2130.      14.5   71  3	"datsun pl510"
28.0   4   140.0      90.00      2264.      15.5   71  1	"chevrolet vega 2300"
25.0   4   113.0      95.00      2228.      14.0   71  3	"toyota corona"
25.0   4   98.00      ?          2046.      19.0   71  1	"ford pinto"
19.0   6   232.0      100.0      2634.      13.0   71  1	"amc gremlin"
16.0   6   225.0      105.0      3439.      15.5   71  1	"plymouth satellite custom"
17.0   6   250.0      100.0      3329.      15.5   71  1	"chevrolet chevelle malibu"
19.0   6   250.0      88.00      3302.      15.5   71  1	"ford torino 500"
18.0   6   232.0      100.0      3288.      15.5   71  1	"amc matador"
14.0   8   350.0      165.0      4209.      12.0   71  1	"chevrolet impala"
14.0   8   400.0      175.0      4464.      11.5   71  1	"pontiac catalina brougham"

Instead of relying on the pandas package to read this data, let's try using the simpler numpy.loadtxt function. We see that each line contains a sample consisting of 8 numbers and one string. However, first we have to figure out how to deal with the missing values. The documentation for numpy.loadtxt reveals that the converters argument can be used to specify a function to apply to the values from particular columns of data. We can use this to deal with the missing values. Looking at auto-mpg.data we see that the missing values are marked with question marks. These are all in the fourth column, the horsepower attribute.

Here is a simple function that translates '?' to NaN (np.nan) and anything else is converted to a floating point number.


In [4]:
def missingIsNan(s):
    if s == b'?':  # single character read as b'?' != '?'
        return np.nan
    else:
        return float(s)
    
print(missingIsNan('12.32'))
print(missingIsNan(b'?'))


12.32
nan

Hey, why not get fancy and write a one-liner?


In [5]:
def missingIsNan(s):
    return np.nan if s == b'?' else float(s)

print(missingIsNan('12.32'))
print(missingIsNan(b'?'))


12.32
nan

The converters argument to np.loadtxt accepts a python dictionary, which is python's associative array structure.


In [6]:
d = {1: 'a', 2: 'yo', 3: 42, 4: missingIsNan}
d


Out[6]:
{1: 'a', 2: 'yo', 3: 42, 4: <function __main__.missingIsNan>}

In [7]:
d[1]


Out[7]:
'a'

In [8]:
d[4]


Out[8]:
<function __main__.missingIsNan>

In [9]:
d[4](b'?')


Out[9]:
nan

Let's also restrict np.loadtxt to reading just the first 8 columns to avoid dealing with the string in the 9th column.


In [10]:
data = np.loadtxt('auto-mpg.data', usecols=range(8), converters={3: missingIsNan})

In [11]:
data.shape


Out[11]:
(398, 8)

In [12]:
data[:3,:]


Out[12]:
array([[  1.80000000e+01,   8.00000000e+00,   3.07000000e+02,
          1.30000000e+02,   3.50400000e+03,   1.20000000e+01,
          7.00000000e+01,   1.00000000e+00],
       [  1.50000000e+01,   8.00000000e+00,   3.50000000e+02,
          1.65000000e+02,   3.69300000e+03,   1.15000000e+01,
          7.00000000e+01,   1.00000000e+00],
       [  1.80000000e+01,   8.00000000e+00,   3.18000000e+02,
          1.50000000e+02,   3.43600000e+03,   1.10000000e+01,
          7.00000000e+01,   1.00000000e+00]])

We can change the precision that ipython uses to display floating point values.


In [13]:
%precision 3


Out[13]:
'%.3f'

In [14]:
data[:3,:]


Out[14]:
array([[  1.800e+01,   8.000e+00,   3.070e+02,   1.300e+02,   3.504e+03,
          1.200e+01,   7.000e+01,   1.000e+00],
       [  1.500e+01,   8.000e+00,   3.500e+02,   1.650e+02,   3.693e+03,
          1.150e+01,   7.000e+01,   1.000e+00],
       [  1.800e+01,   8.000e+00,   3.180e+02,   1.500e+02,   3.436e+03,
          1.100e+01,   7.000e+01,   1.000e+00]])

As we have done before, we must find the missing values. Let's just remove the samples with missing values.


In [15]:
np.isnan(data)


Out[15]:
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ..., 
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]], dtype=bool)

In [16]:
np.sum(np.isnan(data))


Out[16]:
6

In [17]:
np.sum(np.isnan(data), axis=0)


Out[17]:
array([0, 0, 0, 6, 0, 0, 0, 0])

What does this result tell us?

Yep, all 6 NaN values are in the fourth column, the horsepower column.

Let's just remove those 6 samples from the array. To do this build a boolean mask of length equal to the number of rows in data that is False except for any row that contains a NaN. We can use the np.any or np.all methods to combine boolean values in each row or column. First, always try such ideas in multiple steps, checking each step.


In [18]:
nans = np.isnan(data)
nans.shape


Out[18]:
(398, 8)

In [19]:
nans.any(axis=0).shape


Out[19]:
(8,)

In [20]:
nans.any(axis=1).shape


Out[20]:
(398,)

That's the one we want, a boolean value for each row, or sample.


In [21]:
goodRowsMask = nans.any(axis=1)
goodRowsMask


Out[21]:
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False], dtype=bool)

In [22]:
dataNew = data[goodRowsMask,:]
dataNew.shape


Out[22]:
(6, 8)

Wait a minute! This gives us only 6 samples. We wanted all samples but these 6, right?


In [23]:
dataNew


Out[23]:
array([[  2.500e+01,   4.000e+00,   9.800e+01,         nan,   2.046e+03,
          1.900e+01,   7.100e+01,   1.000e+00],
       [  2.100e+01,   6.000e+00,   2.000e+02,         nan,   2.875e+03,
          1.700e+01,   7.400e+01,   1.000e+00],
       [  4.090e+01,   4.000e+00,   8.500e+01,         nan,   1.835e+03,
          1.730e+01,   8.000e+01,   2.000e+00],
       [  2.360e+01,   4.000e+00,   1.400e+02,         nan,   2.905e+03,
          1.430e+01,   8.000e+01,   1.000e+00],
       [  3.450e+01,   4.000e+00,   1.000e+02,         nan,   2.320e+03,
          1.580e+01,   8.100e+01,   2.000e+00],
       [  2.300e+01,   4.000e+00,   1.510e+02,         nan,   3.035e+03,
          2.050e+01,   8.200e+01,   1.000e+00]])

So, let's change all False values to True and True values to False.


In [24]:
goodRowsMask = np.logical_not(goodRowsMask)

In [25]:
dataNew = data[goodRowsMask,:]
dataNew.shape


Out[25]:
(392, 8)

In [26]:
dataNew[:3,:]


Out[26]:
array([[  1.800e+01,   8.000e+00,   3.070e+02,   1.300e+02,   3.504e+03,
          1.200e+01,   7.000e+01,   1.000e+00],
       [  1.500e+01,   8.000e+00,   3.500e+02,   1.650e+02,   3.693e+03,
          1.150e+01,   7.000e+01,   1.000e+00],
       [  1.800e+01,   8.000e+00,   3.180e+02,   1.500e+02,   3.436e+03,
          1.100e+01,   7.000e+01,   1.000e+00]])

In [27]:
np.sum(np.isnan(dataNew),axis=0)


Out[27]:
array([0, 0, 0, 0, 0, 0, 0, 0])

Remember, the next step after reading data into python is to visualize it. One thing we can do is just plot the value of each attribute in a separate graph. Let's make an array of column names to label the y axes.


In [28]:
names =  ['mpg','cylinders','displacement','horsepower','weight',
          'acceleration','year','origin']

In [29]:
plt.figure(figsize=(10,10))
nrow,ncol = dataNew.shape
for c in range(ncol):
    plt.subplot(3,3, c+1)
    plt.plot(dataNew[:,c])
    plt.ylabel(names[c])


What is interesting to you in these graphs?

Let's try to predict miles per gallon, mpg, from the other attributes. To set this up, let's make a 392 x 1 column vector, T, of target values containing all of the mpg values, and a 392 x 7 matrix, X, to hold the inputs to our model.


In [30]:
names


Out[30]:
['mpg',
 'cylinders',
 'displacement',
 'horsepower',
 'weight',
 'acceleration',
 'year',
 'origin']

In [31]:
T = dataNew[:, 0:1]  # dataNew[:,0] results in a one-dimensional matrix, dataNew[:,0:1] preserves its two-dimensional nature.

In [32]:
X = dataNew[:, 1:]

In [33]:
X.shape, T.shape


Out[33]:
((392, 7), (392, 1))

In [34]:
Xnames = names[1:]
Tname = names[0]
Xnames,Tname


Out[34]:
(['cylinders',
  'displacement',
  'horsepower',
  'weight',
  'acceleration',
  'year',
  'origin'],
 'mpg')

Now, let's see if a linear model makes some sense by plotting the target values versus each of the input variables.


In [35]:
plt.figure(figsize=(10,10))
for c in range(X.shape[1]):
    plt.subplot(3,3, c+1)
    plt.plot(X[:,c], T, 'o', alpha=0.5)
    plt.ylabel(Tname)
    plt.xlabel(Xnames[c])


What do you think? Are there any linear relationships between the individual input variables and the target variable? Do they make sense, given your knowledge of automobiles?

Now, to build the linear model. First, let's tack on a column of constant 1 values to the left side of X. With this addition, our matrix multiplication takes care of the $w_0$ coefficient as described above.


In [36]:
X1 = np.hstack((np.ones((X.shape[0],1)), X))
X.shape, X1.shape


Out[36]:
((392, 7), (392, 8))

In [37]:
X1[:3,:]


Out[37]:
array([[  1.000e+00,   8.000e+00,   3.070e+02,   1.300e+02,   3.504e+03,
          1.200e+01,   7.000e+01,   1.000e+00],
       [  1.000e+00,   8.000e+00,   3.500e+02,   1.650e+02,   3.693e+03,
          1.150e+01,   7.000e+01,   1.000e+00],
       [  1.000e+00,   8.000e+00,   3.180e+02,   1.500e+02,   3.436e+03,
          1.100e+01,   7.000e+01,   1.000e+00]])

And, let's add a name to Xnames.


In [38]:
Xnames.insert(0, 'bias')
Xnames


Out[38]:
['bias',
 'cylinders',
 'displacement',
 'horsepower',
 'weight',
 'acceleration',
 'year',
 'origin']

We could try to fit a linear model to all of the data and check to see how accurately we predict mpg for each sample. However, this will give a much too optimistic expectation of how well the model will do on new data.

A much better way to evaluate a model is to remove, or hold out, some of the samples from the data set used to fit the model. Then apply the model to the held out samples and compare the predicted mpg with the actual mpg. Call these held out samples the test set and the other samples used to fit the model the train set.

How should we partition the data into these two disjoint subsets? A common practice is to randomly select 80% of the samples as training samples and use the remaining 20% of the samples as testing samples.

To partition our samples (rows of X and T) into training and test sets, let's deal with just the row indices. Then we will use the same selected row indices to slice out corresponding rows of X and T.


In [39]:
nrows = X1.shape[0]
nTrain = int(round(nrow*0.8))
nTest = nrow - nTrain
nTrain,nTest,nTrain+nTest


Out[39]:
(314, 78, 392)

In [40]:
rows = np.arange(nrows)
np.random.shuffle(rows)
rows


Out[40]:
array([163, 144,  71,  43, 111, 274, 254,  13, 202, 161, 220, 224, 232,
       290,  83, 196, 262, 252, 342, 216, 126,   3,  82, 335, 121, 381,
       350,  73,  76, 368,  41, 315, 178,  10, 374,   2, 166, 310, 280,
        75, 172, 186, 151, 324,  64, 353, 113, 332,  45,  51,  47, 104,
       366, 139, 365, 389, 106, 377,  29, 259, 105, 150, 364,  12, 123,
        98, 131,  78,  14, 358, 209, 298, 387, 317, 313, 263, 137, 248,
        42, 108,   8, 174, 228,  94, 206,  44, 362, 107, 181, 231, 361,
       349, 295, 363,  27, 301, 212,  16,  99, 155, 225, 109,  58, 320,
       145, 112, 244, 307, 238, 380,  90, 147,  34,  93,  26,  67, 170,
        40, 372, 388, 115, 345, 149, 355, 316,  77, 164, 160, 148, 275,
         4, 390, 100, 239,  92, 119, 359, 219,   9,  74,  48, 378,  89,
        31,  61, 391, 132,  91, 284, 222, 356,  54, 285,  95,  57, 142,
        35, 237,  60, 249, 134, 162, 158, 279, 281, 354, 229, 246, 379,
       143, 221,  22, 309, 257, 376, 337,   1, 173, 179, 300, 127, 159,
       235, 245, 272, 348, 343, 122,  66, 200, 205, 234, 286,  28, 213,
        19,  21, 241,  46,  70, 194, 352,  49,   7, 141, 171,  15,  80,
       271, 269,  50, 258, 129, 146, 260, 128, 360, 296,  97, 326, 215,
       120, 371, 125, 273, 180, 305, 183,  30, 243, 203, 208, 264, 267,
       118, 199,  96, 292,  11,  65, 293, 191, 176, 167, 306, 357, 319,
       185,   6, 287, 341,   5, 135, 242,  52, 299, 266, 288, 187,   0,
       201, 383, 375, 386, 218,  39, 214, 138, 329,  81, 304,  37,  17,
       283,  24, 198, 256, 124, 130,  20, 370, 289, 133,  53, 338, 278,
       182, 291,  72, 282, 247, 346, 153, 334,  68, 168, 311, 250, 385,
       347,  88, 197, 331, 303, 116, 236, 193, 314, 117,  38, 251, 114,
       253,  69, 297, 110, 169, 190, 330,  36, 211, 382,  59, 177, 270,
       165, 207, 261, 226, 223, 312, 321, 157, 327, 369, 195, 339, 367,
       175, 210, 192, 351, 344,  62, 217, 136,  85,  32,  25,  63, 373,
       325,  18, 384, 277, 189, 302, 255, 156,  84, 230, 204,  56, 318,
       102,  23, 233, 227, 308, 154, 101, 152, 103, 340,  86,  79, 328,
       323, 322, 240, 294,  33, 333, 140, 268, 276, 336, 265,  87,  55,
       188, 184])

In [41]:
trainIndices = rows[:nTrain]
testIndices = rows[nTrain:]
trainIndices,testIndices


Out[41]:
(array([163, 144,  71,  43, 111, 274, 254,  13, 202, 161, 220, 224, 232,
        290,  83, 196, 262, 252, 342, 216, 126,   3,  82, 335, 121, 381,
        350,  73,  76, 368,  41, 315, 178,  10, 374,   2, 166, 310, 280,
         75, 172, 186, 151, 324,  64, 353, 113, 332,  45,  51,  47, 104,
        366, 139, 365, 389, 106, 377,  29, 259, 105, 150, 364,  12, 123,
         98, 131,  78,  14, 358, 209, 298, 387, 317, 313, 263, 137, 248,
         42, 108,   8, 174, 228,  94, 206,  44, 362, 107, 181, 231, 361,
        349, 295, 363,  27, 301, 212,  16,  99, 155, 225, 109,  58, 320,
        145, 112, 244, 307, 238, 380,  90, 147,  34,  93,  26,  67, 170,
         40, 372, 388, 115, 345, 149, 355, 316,  77, 164, 160, 148, 275,
          4, 390, 100, 239,  92, 119, 359, 219,   9,  74,  48, 378,  89,
         31,  61, 391, 132,  91, 284, 222, 356,  54, 285,  95,  57, 142,
         35, 237,  60, 249, 134, 162, 158, 279, 281, 354, 229, 246, 379,
        143, 221,  22, 309, 257, 376, 337,   1, 173, 179, 300, 127, 159,
        235, 245, 272, 348, 343, 122,  66, 200, 205, 234, 286,  28, 213,
         19,  21, 241,  46,  70, 194, 352,  49,   7, 141, 171,  15,  80,
        271, 269,  50, 258, 129, 146, 260, 128, 360, 296,  97, 326, 215,
        120, 371, 125, 273, 180, 305, 183,  30, 243, 203, 208, 264, 267,
        118, 199,  96, 292,  11,  65, 293, 191, 176, 167, 306, 357, 319,
        185,   6, 287, 341,   5, 135, 242,  52, 299, 266, 288, 187,   0,
        201, 383, 375, 386, 218,  39, 214, 138, 329,  81, 304,  37,  17,
        283,  24, 198, 256, 124, 130,  20, 370, 289, 133,  53, 338, 278,
        182, 291,  72, 282, 247, 346, 153, 334,  68, 168, 311, 250, 385,
        347,  88, 197, 331, 303, 116, 236, 193, 314, 117,  38, 251, 114,
        253,  69]),
 array([297, 110, 169, 190, 330,  36, 211, 382,  59, 177, 270, 165, 207,
        261, 226, 223, 312, 321, 157, 327, 369, 195, 339, 367, 175, 210,
        192, 351, 344,  62, 217, 136,  85,  32,  25,  63, 373, 325,  18,
        384, 277, 189, 302, 255, 156,  84, 230, 204,  56, 318, 102,  23,
        233, 227, 308, 154, 101, 152, 103, 340,  86,  79, 328, 323, 322,
        240, 294,  33, 333, 140, 268, 276, 336, 265,  87,  55, 188, 184]))

Check that the training and testing sets are disjoint.


In [42]:
np.intersect1d(trainIndices, testIndices)


Out[42]:
array([], dtype=int64)

In [43]:
Xtrain = X1[trainIndices,:]
Ttrain = T[trainIndices,:]
Xtest = X1[testIndices,:]
Ttest = T[testIndices,:]
Xtrain.shape,Ttrain.shape, Xtest.shape,Ttest.shape


Out[43]:
((314, 8), (314, 1), (78, 8), (78, 1))

We want to solve for $\wv$ in the equation $X^T X \wv = X^T T$. This is done with the numpy.linalg.lstsq function. Don't be confused by the Ts. Remember something.T is the transpose of something.


In [82]:
w = np.linalg.lstsq(np.dot(Xtrain.T,Xtrain), np.dot(Xtrain.T, Ttrain))
w = w[0] # to only keep the weights, and discard other information returned by lstsq
w,len(w)


Out[82]:
(array([[ -9.656e-01,  -2.106e+00],
        [  1.906e-02,   1.977e-01],
        [ -6.802e-03,   2.206e-02],
        [  8.632e-02,  -3.792e+00],
        [  5.584e-01,   7.645e-01],
        [  9.250e-01,   8.223e+00]]), 6)

In [45]:
for wi,name in zip(w.flat,Xnames):
    print('{:8.3f}  {:s}'.format(wi,name))


 -11.113  bias
  -0.912  cylinders
   0.025  displacement
  -0.018  horsepower
  -0.006  weight
   0.068  acceleration
   0.691  year
   1.281  origin

How can you figure out what np.linalg.lstsq is doing? Try finding the source code!


In [46]:
!locate linalg.py


WARNING: The locate database (/var/db/locate.database) does not exist.
To create the database, run the following command:

  sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.locate.plist

Please be aware that the database can take some time to generate; once
the database has been created, this message will no longer appear.

In my version I see documentation for lstsq that states

Return the least-squares solution to a linear matrix equation.

Solves the equation `a x = b` by computing a vector `x` that
minimizes the Euclidean 2-norm `|| b - a x ||^2`.  The equation may
be under-, well-, or over- determined (i.e., the number of
linearly independent rows of `a` can be less than, equal to, or
greater than its number of linearly independent columns).  If `a`
is square and of full rank, then `x` (but for round-off error) is
the "exact" solution of the equation.

Now that we have a linear model, which is simply $w$, let's use it to predict mpg for some samples.


In [47]:
X1[:4,:]


Out[47]:
array([[  1.000e+00,   8.000e+00,   3.070e+02,   1.300e+02,   3.504e+03,
          1.200e+01,   7.000e+01,   1.000e+00],
       [  1.000e+00,   8.000e+00,   3.500e+02,   1.650e+02,   3.693e+03,
          1.150e+01,   7.000e+01,   1.000e+00],
       [  1.000e+00,   8.000e+00,   3.180e+02,   1.500e+02,   3.436e+03,
          1.100e+01,   7.000e+01,   1.000e+00],
       [  1.000e+00,   8.000e+00,   3.040e+02,   1.500e+02,   3.433e+03,
          1.200e+01,   7.000e+01,   1.000e+00]])

In [48]:
predict = np.dot(X1[:4,:],w)
predict


Out[48]:
array([[ 14.924],
       [ 14.11 ],
       [ 15.199],
       [ 14.942]])

How do these predictions compare with the actual mpg values? We can either make a two column matrix, or use a for loop to print them.


In [49]:
np.hstack(( predict, Ttrain[:4,:]))


Out[49]:
array([[ 14.924,  20.   ],
       [ 14.11 ,  28.   ],
       [ 15.199,  15.   ],
       [ 14.942,  13.   ]])

In [50]:
print('{:^5} {:^5}'.format('P','T'))
for (p,t) in zip(predict,Ttrain[0:4,:]):
    # print(p,t)
    print('{:5.2f} {:5.2f}'.format(p[0],t[0]))


  P     T  
14.92 20.00
14.11 28.00
15.20 15.00
14.94 13.00

Let's try all of the test data and plot the results.


In [51]:
predict = np.dot(Xtest, w)
predict.shape, Ttest.shape


Out[51]:
((78, 1), (78, 1))

In [52]:
plt.plot(predict,Ttest,'o')
plt.xlabel('Predicted MPG')
plt.ylabel('Actual MPG')
# add a 45 degree line
a = max(min(predict),min(Ttest))
b = min(max(predict),max(Ttest))
plt.plot([a,b],[a,b], 'r', linewidth=3,alpha=0.7);


Not too shabby! But, how about a numerical measure of accuracy?

Right, just calculate the root-mean-square error (RMSE).


In [53]:
np.sqrt( np.mean( (np.dot(Xtest,w) - Ttest)**2))


Out[53]:
3.760

This means we are about 2.8 mpg off in our predictions, on average.

Importance of Attributes (Input Variables)

Here again are the weights we learned before for predicting mpg. Can we use these to decide which attibutes are most important? Can we remove some from the model?


In [54]:
for wi,name in zip(w.flat,Xnames):
    print('{:8.3f}  {:s}'.format(wi,name))


 -11.113  bias
  -0.912  cylinders
   0.025  displacement
  -0.018  horsepower
  -0.006  weight
   0.068  acceleration
   0.691  year
   1.281  origin

Perhaps year and origin are the most significant. Does this make sense?

A weight magnitude is determined not only by the linear relationship between the corresponding input variable with the target, but also by the variable's range.

What are the ranges of the input variables?


In [55]:
for n,mn,mx in zip(Xnames,np.min(X1,axis=0),np.max(X1,axis=0)):
    print('{:>20} {:8.2f} {:8.2f}'.format(n,mn,mx))


                bias     1.00     1.00
           cylinders     3.00     8.00
        displacement    68.00   455.00
          horsepower    46.00   230.00
              weight  1613.00  5140.00
        acceleration     8.00    24.80
                year    70.00    82.00
              origin     1.00     3.00

The weight for weight is the smallest magnitude, but the range of its values are the largest.

The weight for origin is the largest, but the range of its values are the smallest.

If we could remove the effect of the ranges on the weight magnitudes, we could use the weight magnitudes to judge the relative importance of each input variable. How?

A common approach is to standardize the input variables by adjusting the values to have zero mean and unit variance.


In [56]:
Xs = (X - X.mean(axis=0)) / X.std(axis=0)
Xs.shape


Out[56]:
(392, 7)

In [57]:
Xs.mean(0)


Out[57]:
array([ -2.169e-16,  -2.538e-16,  -3.965e-16,   6.840e-17,   6.241e-15,
        -4.384e-16,   2.334e-16])

In [58]:
Xs.std(0)


Out[58]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.])

To do this correctly when partitioning data into training and testing sets, we must always calculate means and standard deviations using only the training set, and use the same means and standard deviations when standardizing the testing set. Remember, you must not use any information about the testing set when building a model. If you do, your test error will be lower than it will be when you truly see new data.

You can do this by storing the means and standard deviations obtained from the training data, and just use them when standardizing the testing data. Don't forget to add the column of 1's after standardizing.


In [59]:
means = np.mean(X,axis=0)
stds = np.std(X,axis=0)
Xs = (X - means) / stds
Xs1 = np.hstack((np.ones((Xs.shape[0],1)), Xs))                
w = np.linalg.lstsq( np.dot(Xs1.T,Xs1), np.dot(Xs1.T, T) )[0]
w


Out[59]:
array([[ 23.446],
       [ -0.841],
       [  2.079],
       [ -0.652],
       [ -5.492],
       [  0.222],
       [  2.762],
       [  1.147]])

Another way is to construct functions for standardizing that include the calculated means and standard deviations as local variables, by using function closures.


In [60]:
def makeStandardize(X):
    means = X.mean(axis=0)
    stds = X.std(axis=0)
    
    def standardize(origX):
        return (origX - means) / stds
    
    def unStandardize(stdX):
        return stds * stdX + means
    
    return (standardize, unStandardize)

Let's start with X again, and tack on the column of 1's after dividing data into training and testing partitions.


In [61]:
Xtrain = X[trainIndices,:]
Ttrain = T[trainIndices,:]
Xtest = X[testIndices,:]
Ttest = T[testIndices,:]

(standardize, unStandardize) = makeStandardize(Xtrain)

XtrainS = standardize(Xtrain)
XtestS = standardize(Xtest)

np.mean(XtrainS,axis=0), np.std(XtrainS,axis=0), np.mean(XtestS,axis=0), np.std(XtestS,axis=0)


Out[61]:
(array([ -1.110e-16,  -2.192e-17,  -1.117e-16,  -2.906e-16,   5.316e-15,
         -7.404e-16,  -1.223e-16]),
 array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.]),
 array([-0.177, -0.092, -0.063, -0.028, -0.003,  0.157,  0.203]),
 array([ 0.985,  0.993,  0.962,  1.067,  0.994,  0.911,  1.075]))

Notice that the means and standard deviations for the testing set are not as close to 0 and 1 as they are for the training set. Why?

Now we can tack on the column of 1's, solve for w, and examine the weight magnitudes.


In [62]:
XtrainS1 = np.hstack((np.ones((XtrainS.shape[0],1)), XtrainS))
XtestS1 = np.hstack((np.ones((XtestS.shape[0],1)), XtestS))
w = np.linalg.lstsq( np.dot(XtrainS1.T,XtrainS1), np.dot(XtrainS1.T, Ttrain))[0]  # see this [0]?
for wi,name in zip(w.flat,Xnames):
    print('{:8.3f}  {:s}'.format(wi,name))


  23.226  bias
  -1.555  cylinders
   2.571  displacement
  -0.699  horsepower
  -5.339  weight
   0.188  acceleration
   2.582  year
   1.012  origin

Now what do you observe about the relative magnitudes? If you had a ton of input variables, it would be easier to see if we sorted them by their magnitudes.


In [63]:
np.abs(w)


Out[63]:
array([[ 23.226],
       [  1.555],
       [  2.571],
       [  0.699],
       [  5.339],
       [  0.188],
       [  2.582],
       [  1.012]])

In [64]:
np.argsort(np.abs(w.flat))


Out[64]:
array([5, 3, 7, 1, 2, 6, 4, 0])

In [65]:
np.argsort(np.abs(w.flat))[::-1]


Out[65]:
array([0, 4, 6, 2, 1, 7, 3, 5])

In [66]:
sortedOrder = np.argsort(np.abs(w.flat))[::-1]
Xnames = np.array(Xnames)
for wi,name in zip(w.flat[sortedOrder],Xnames[sortedOrder]):
    print('{:8.3f}  {:s}'.format(wi,name))


  23.226  bias
  -5.339  weight
   2.582  year
   2.571  displacement
  -1.555  cylinders
   1.012  origin
  -0.699  horsepower
   0.188  acceleration

Multiple Target Components

The beauty of matrix equations now shines. Previously $T$ was an $N \times 1$ matrix of target values, one per sample.

Just add additional columns of target values for each target component. So $T$ becomes an $N \times K$ matrix, if there are $K$ output components. Each row is one $K$-dimensional sample.

Looking again at the solution for $\wv$,

$$ \begin{align*} \wv &= (X^T X)^{-1} X^T T \end{align*} $$

we see that this changes $\wv$ from its original form of $D \times 1$ to $D \times K$.

Let's use this to predict miles per gallon and horsepower.

First, let's assemble the data for predicting MPG and HP.


In [67]:
X = dataNew[:,[1,2,4,5,6,7]]
T = dataNew[:,[0,3]] 
Tnames = [names[0], names[3]]
X.shape,Xnames,T.shape,Tnames


Out[67]:
((392, 6), array(['bias', 'cylinders', 'displacement', 'horsepower', 'weight',
        'acceleration', 'year', 'origin'], 
       dtype='<U12'), (392, 2), ['mpg', 'horsepower'])

In [68]:
Xtrain = X[trainIndices,:]
Ttrain = T[trainIndices,:]
Xtest = X[testIndices,:]
Ttest = T[testIndices,:]
Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape


Out[68]:
((314, 6), (314, 2), (78, 6), (78, 2))

In [69]:
standardize,_ = makeStandardize(Xtrain)
XtrainS = standardize(Xtrain)
XtestS = standardize(Xtest)
XtrainS1 = np.hstack((np.ones((XtrainS.shape[0],1)), XtrainS))
Xnames = np.array(['bias']+names)[[0,2,3,5,6,7,8]]
Xnames


Out[69]:
array(['bias', 'cylinders', 'displacement', 'weight', 'acceleration',
       'year', 'origin'], 
      dtype='<U12')

In [70]:
XtrainS1.shape,Ttrain.shape


Out[70]:
((314, 7), (314, 2))

In [71]:
w = np.linalg.lstsq( np.dot(XtrainS1.T, XtrainS1), np.dot(XtrainS1.T, Ttrain))[0]
w


Out[71]:
array([[  23.226,  104.955],
       [  -1.454,   -5.624],
       [   2.247,   17.994],
       [  -5.667,   18.187],
       [   0.409,  -12.271],
       [   2.636,   -2.961],
       [   0.933,    4.353]])

In [72]:
Xnames = np.array(Xnames)
for targeti in range(2):
    print('\nTarget {}\n'.format(Tnames[targeti]))
    thisw = w[:,targeti]
    sortedOrder = np.argsort(np.abs(thisw))[::-1]
    for wi,name in zip(thisw[sortedOrder],Xnames[sortedOrder]):
        print('{:8.3f}  {:s}'.format(wi,name))


Target mpg

  23.226  bias
  -5.667  weight
   2.636  year
   2.247  displacement
  -1.454  cylinders
   0.933  origin
   0.409  acceleration

Target horsepower

 104.955  bias
  18.187  weight
  17.994  displacement
 -12.271  acceleration
  -5.624  cylinders
   4.353  origin
  -2.961  year

Now try predicting both mpg and horsepower.


In [73]:
XtestS1 = np.hstack((np.ones((XtestS.shape[0],1)), XtestS))
prediction = np.dot(XtestS1,w)
prediction.shape


Out[73]:
(78, 2)

In [74]:
plt.figure(figsize=(10,10))
for p in range(2):
    plt.subplot(2,1,p+1)
    plt.plot(prediction[:,p],Ttest[:,p],'o')
    plt.xlabel("Predicted " + Tnames[p])
    plt.ylabel("Actual " + Tnames[p])
    a = max(min(prediction[:,p]),min(Ttest[:,p]))
    b = min(max(prediction[:,p]),max(Ttest[:,p]))
    plt.plot([a,b],[a,b],'r',linewidth=3)


How well did we do in terms of RMSE?


In [75]:
rmseTrain = np.sqrt(np.mean((np.dot(XtrainS1,w) - Ttrain)**2,axis=0))
rmseTrain


Out[75]:
array([  3.194,  12.085])

In [76]:
rmseTest = np.sqrt(np.mean((np.dot(XtestS1,w) - Ttest)**2,axis=0))
rmseTest


Out[76]:
array([  3.764,  12.717])

In [77]:
print('Training RMSE: MPG {:4.2f} HP {:4.2f}'.format(*rmseTrain))   #what is that * doing there??
print(' Testing RMSE: MPG {:4.2f} HP {:4.2f}'.format(*rmseTest))


Training RMSE: MPG 3.19 HP 12.08
 Testing RMSE: MPG 3.76 HP 12.72

In [ ]: